Memory system with region-specific memory access scheduling

ABSTRACT

An integrated circuit device includes a memory controller coupleable to a memory. The memory controller to schedule memory accesses to regions of the memory based on memory timing parameters specific to the regions. A method includes receiving a memory access request at a memory device. The method further includes accessing, from a timing data store of the memory device, data representing a memory timing parameter specific to a region of the memory cell circuitry targeted by the memory access request. The method also includes scheduling, at the memory controller, the memory access request based on the data.

BACKGROUND Field of the Disclosure

The present disclosure relates generally to memory devices and relatesmore particularly to scheduling memory accesses for a memory device.

Description of the Related Art

Memory latency and bandwidth are significant performance bottlenecks inmany processing systems. These performance factors typically areimpacted by memory cell access times and the propagation delays ofsignals traversing various interconnects within the memory. Inconventional memory controller implementations, a worst-case approach istaken for scheduling memory accesses to a memory subsystem whereby thememory controller assumes fixed worst-case memory timing parameters thatapply to the entire memory subsystem.

BRIEF DESCRIPTION OF THE DRAWINGS

The present disclosure may be better understood, and its numerousfeatures and advantages made apparent to those skilled in the art byreferencing the accompanying drawings. The use of the same referencesymbols in different drawings indicates similar or identical items.

FIG. 1 is a diagram illustrating an example processing system employinga memory with region-specific memory access scheduling in accordancewith some embodiments.

FIG. 2 is a diagram illustrating an exploded perspective view of anexample vertical-stacked-die implementation of the processing system ofFIG. 1 in accordance with some embodiments.

FIG. 3 is a diagram illustrating an exploded perspective view of anotherexample vertical-stacked-die implementation of the processing system ofFIG. 1 in accordance with some embodiments.

FIG. 4 is a diagram illustrating an example memory controller employinga timing profiler in accordance with some embodiments.

FIG. 5 is a flow diagram illustrating a method for profiling a memorytiming parameter of a memory in accordance with some embodiments.

FIG. 6 is a timing diagram illustrating a process of scheduling memoryaccesses based on region-specific memory timing parameters in accordancewith some embodiments.

FIG. 7 is a flow diagram illustrating a method for designing andfabricating an integrated circuit (IC) device implementing a memorycontroller to schedule memory accesses based on region-specific memorytiming parameters in accordance with some embodiments.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

FIGS. 1-7 illustrate example techniques for reducing latency andimproving bandwidth and power efficiency of a memory subsystem of aprocessing system or other electronic device through the use of memoryaccess scheduling based on region-specific memory timing parameters.Various factors, such as variations in the semiconductor fabricationprocess, device temperature, and aging effects, can cause the memorycell access times and signal propagation times to vary significantly formemory accesses to different regions of a memory, which becomesparticularly pernicious when the supply voltage scales down to thetransistor threshold voltage (Vth) level for the purposes of powerreduction. In conventional systems, these performance variations aretypically managed by selecting conservative timing parameters based onworst-case scenarios and using these same conservative timing parametersfor each memory access, regardless of the region targeted by the memoryaccess. In contrast, in some embodiments, a memory controller moreefficiently schedules memory accesses by profiling one or more memorytiming parameters for each region of multiple regions of a memory,saving the measured timing parameters to a timing data store and thenscheduling memory accesses to the memory based on the profiledregion-specific memory timing parameters associated with the regionstargeted by the memory accesses. Thus, the memory controller can reducelatency and improve power efficiency of a memory subsystem by exploitingthe performance variations among different regions of the memory suchthat the timing and sequencing of the memory accesses reflect thedifferent memory timings exhibited by the different memory regions. Thegranularity of the profiled memory regions can include, for example,sets of one or more cells, rows, banks, ranks, chips, memory modules(e.g., dual inline memory modules or “DIMMS”), channels, and the like.

Stacked-die, or three-dimensional (3D), dynamic random access memory(DRAM)-based memory subsystems often are considered for use as largeon-chip caches. However, due to certain aspects of the DRAMarchitecture, due to die-to-die process variation, and due to theinterconnect configurations resulting the stacked arrangement of thememory dies in a stacked-die memory, such memory subsystems oftenexhibit significant region-to-region variations in memory timing. This,in conjunction with ability to locally store memory timing informationon a region-by-region basis makes stacked-die memory well suited forregion-based memory access scheduling. Accordingly, example region-basedmemory access scheduling techniques are described herein in the examplecontext of stacked-die DRAM memory. However, these techniques are notlimited to this example context, but also are equally applicable to anyvariety of stacked or non-stacked memory architectures including, butnot limited to, static random access memory (SRAM), ferroelectric RAM(F-RAM), magneto-resistive RAM (MRAM), Phase-Change RAM, resistive RAM,Flash memory, and the like.

FIG. 1 illustrates an example processing system 100 implementingregion-based memory access scheduling in accordance with someembodiments. The processing system 100 can comprise any of a variety ofprocessing systems, including a notebook or tablet computer, a desktopcomputer, a server, a network router, switch, or hub, acomputing-enabled cellular phone, a personal digital assistant, a gamingconsole, and the like. The processing system 100 may be implemented on asingle die or on multiple dies including a plurality of stacked die asshown later examples.

In the depicted example, the processing system 100 includes one or moreprocessors 101, one or more memory controllers 102, and a memory array104. The processor 101 includes one or more processor cores 116, such asone or more central processing units (CPU), graphics processing units(GPU), digital signal processors (DSP), and the like. The processor 101further includes an integrated northbridge 120, an input/output (I/O)controller 122, one or more levels of cache 118, and the like. Thenorthbridge 120 is connected to the memory controller 102 via host bus126. The host bus 126 can be implemented in accordance with any of avariety of conventional interconnect or bus architectures, such as aPeripheral Component Interconnect-Express (PCI-E) architecture, aHyperTransport architecture, a QuickPath Interconnect (QPI)architecture, and the like. Alternatively, host bus 126 can beimplemented in accordance with a proprietary bus architecture. The I/Ocontroller 122 can be connected to one or more external devices 124,such as one or more external interfaces to display components, storagedevices, input devices (e.g., a mouse or keyboard), and the like.

The processing components of system 100 are coupled to a memorysubsystem comprised of the memory controller 102 and the memory array104. The memory array 104 may implement any variety of storage devicesincluding, but not limited to, memory architectures such as dynamicrandom access memory (DRAM), static random access memory (SRAM),ferroelectric RAM (F-RAM) magneto-resistive RAM (MRAM) and the like. Forthe purposes of this disclosure, various embodiments are described inthe context of DRAM arrays and DRAM memory controllers. In this examplecontext, the memory array 104 comprises a plurality of DRAM arrays 106.As described in detail below, the memory subsystem may be implemented asstacked-die memory devices and thus the DRAM arrays 106 may beimplemented as, for example, separate memory die within the stacked-diememory.

In operation, the memory controller 102 services memory read and memorywrite requests issued by the processing elements of the processingsystem 100. To this end, the memory controller 102 implements controllerlogic 108 for accessing the DRAM arrays 106 and for interfacing with thehost bus 126. In response to memory access requests, the controllerlogic 108 issues commands to the DRAM arrays 106. In response to a readrequest for a corresponding address, the controller logic 108 transmitssignaling to the DRAM arrays 106 to read stored data at the addressedlocation. In response to a write request, the controller logic 108transmits signaling to the DRAM arrays 106 to write data to theaddressed location.

The correct operation of each memory transaction is predicated on thecontroller logic 108 adhering to certain memory timing parameters, suchas the RAS precharge, or “tRP” timing parameter, which reflects the timeneeded between disabling access to one line of data and enabling accessto another line of data. In implementation, some or all of the memorytiming parameters may vary depending on the region of the memory array104 being accessed. Accordingly, in some embodiments, the controllerlogic 108 is configured to buffer, reorder, or schedule memory requestsbased on region-specific memory timing parameters so as to improveperformance and efficiency. To this end, the memory controller 102 alsoimplements profiling logic 110 and a timing data store 112 to determineand store region-specific memory timing information. As described inmore detail below with reference to FIGS. 4 and 6, in some embodiments,the profiling logic 110 evaluates each memory region of a set of one ormore memory regions of the memory array 104 to determine one or morememory timing parameters specific to that region. The memory timinginformation for the region then may be maintained in the timing datastore 112, which can comprise any of a variety of storage mechanisms,such as a memory array, a set of registers or a cache, a bloom filter, acontent addressable memory (CAM), and the like. The granularity of theregions of the memory array 104 so analyzed can include, but is notlimited to, cell-based regions, bank-based regions, row-based regions,channel-based regions, chip-based regions, memory-module-based regions,and the like, as well as combinations thereof.

In response to a trigger event, the profiling logic 110 initiates a testprocess to measure the corresponding memory timing parameters for eachregion of the applicable regions of the memory array 104. This triggerevent can include, for example, a power-on reset, a lapse of a timer, acommand from the processor 101 or other controller, a sensed change intemperature that exceeds a threshold (and thereby indicating a likelytemperature-induced change in the timing parameters), a triggerinitiated during a dynamic voltage and frequency scaling (DVFS)operation, and the like. The test process can include, for example,repeatedly writing and subsequently accessing a test pattern to thecorresponding region while varying the values for the applicable memorytiming parameters. The timing parameters could be also updated byperforming a lookup to a table that maintains latency values underdifferent temperature or service times or stores representations offormulae that may be used to calculate the latency using temperature andservice time as an input. When reliable minimum memory timing parametersare identified for the region via this test process, the profiling logic110 can store data representative of these identified memory timingparameters to the timing data store 112. In some embodiments, the timingdata store 112 is implemented as a set of registers, cache, or CAM andthe data is stored in a location associated with the region. In otherembodiments, the timing data store 112 is implemented as a bloom filteror other compressed data structure, and this data is incorporated intothe compressed data structure accordingly.

In the course of typical operation, the memory parameter timinginformation is accessible from the timing data store 112 by thecontroller logic 108. In response to receiving a memory access requestfrom the processor 101, the controller logic 108 uses the addresstargeted by the memory access request to access the memory timinginformation associated with the region associated with the targetedaddress. With this region-specific memory timing information, thecontroller logic 108 can manage the timing of memory commands sent tothe memory array 104 according the stored memory timing parameters for agiven region of memory so as to more efficiently schedule memoryaccesses. For example, when a memory read request is received by thememory controller 102, the controller logic 108 identifies the region ofmemory to be accessed based on the address of the memory read requestand then communicates with the profiling logic 110 and timing data store112 to determine the memory timing parameters to that region. Thecontroller logic then schedules and transmits commands to DRAM arrays106 according to the stored timing parameters. As described in greatdetail below with reference to FIG. 4, this command scheduling alsoincludes ensuring there are no resource conflicts such as data buscollisions due to multiple accesses with different timings.

Significant variations in memory timing parameters are often encounteredin stacked arrangement of memory dies. Memory controllers implementingregion-based memory access scheduling techniques are able to exploitthese variations to more efficiently schedule memory requests and arethus well-suited to stacked arrangements of memory die.

FIG. 2 illustrates an exploded perspective view of an exampleimplementation of the processing system 100 as a stacked-die processingsystem 200 in accordance with some embodiments. In the depicted example,the stacked-die processing system 200 includes a plurality of dies, suchas dies 202, 204, 206, 208, 210, and 212 (collectively referred toherein as “dies 202-212”), stacked in the illustrated verticalarrangement and interconnected via through silicon vias (TSVs) 218. Inthe illustrated example, stacked memory dies 206, 208, 210, and 212implement the memory cell circuitry 230 that forms the DRAM arrays 106(FIG. 1) of the memory array 104 (FIG. 1). The dies 206-212 thus arereferred to herein as “memory dies.” While described in the examplecontext of DRAM, the memory cell circuitry 230 may implement any varietyof memory architectures including, but not limited to, DRAM, staticrandom access memory (SRAM), read-only memory (ROM), flash memoryferroelectric RAM (F-RAM), magneto-resistive RAM (MRAM) and the like.

The die 202 implements the processing components of the processor 101 asdescribed above, including one or more processor cores 221 and 222 and acache 224, and is thus referred to herein as the “processor die 202”.The die 204 implements the memory controller 102, which includes theprofiling logic 110. The memory controller 102 implements hardwiredlogic for accessing the memory circuitry of stacked die 206-212 as wellas interfacing to the die 202 to service read and write memory accessrequests. The dies 202 and 204 also can include any of a variety ofperipheral devices such as northbridges, southbridges, input/outputcontrollers, network interfaces, and the like. The dies 202 and 204 thusare referred to herein as “logic dies.”

The timing data store 112 may be implemented as a cache, CAM, bloomfilter, or register set on a logic die of the stacked-die processingsystem 200. To illustrate, the timing data store 112 may be implementedas non-volatile memory (e.g., flash memory) on die 204 along with theother components of the memory controller 102 to facilitate timelyaccess to the memory timing information by the profiling logic andcontroller logic 108 of the memory controller 102. In other embodiments,stacked-die processing system 200 can take advantage of the sizeablestorage capacity of the memory array 104 by implementing the timing datastore 112 in the memory cell circuitry 230 of one or more of the memorydies. In yet other embodiments, the timing data store 112 may beimplemented using both logic die and memory die. To illustrate, thetiming data store 112 may be implemented as a small cache on a logic dieand a larger cache on a memory die, with the small cache storing thememory timing information for the most recently accessed regions ofmemory and larger cache storing the memory timing information for all ofthe memory regions.

In some embodiments, the components of the processor 101 and the memorycontroller 102 may be partitioned in other combinations across multipledies or implemented on single die. To illustrate, the memory controller102 and the processor 101 may be implemented on the same die. As anotherexample, the processor 101 may be implemented on one die, the controllerlogic 108 (FIG. 1) and the memory interface 114 (FIG. 1) implemented ona second die, and the profiling logic 110 and timing data storeimplemented on a third die.

Stacked memory subsystems typically employ a separate logic layer toimplement the memory controller and related logic functions so that adifferent semiconductor fabrication process can be used to fabricate thelogic dies 204 than that used to fabricate the memory layers 206-212.Accordingly, the vertical die stack of processing system 200 may befabricated using any of a variety of 3D integrated circuit fabricationprocesses. In one approach, the die 202-212 each are implemented as aseparate substrate (e.g., bulk silicon) with active devices and one ormore metal routing layers formed at an active surface. This approach caninclude a wafer-on-wafer process whereby a wafer comprising a matrix ofdice is fabricated and thinned, and TSVs are etched through the bulksilicon. Multiple wafers are then stacked to achieve the illustratedlayer configuration (e.g., a stack of four wafers comprising memorycircuitry die for the three memory layers and a wafer comprising thelogic die for a logic layer), aligned, and then joined viathermocompression. The resulting stacked wafer set is singulated toseparate the individual 3D IC devices.

In a die-on-die process, the wafer implementing each corresponding layeris first singulated, and then the die are separately stacked and joinedto fabricate the 3D IC devices. In a die-on-wafer approach, wafers forone or more layers are singulated to generate the die for one or morelayers, and these die are then aligned and bonded to the correspondingdie areas of another wafer, which is then singulated to produce theindividual 3D IC devices. One benefit of fabricating the die 202-212 onseparate wafers is that a different fabrication process can be used tofabricate the logic layers (die 202-204) than that used to fabricate thememory die (die 206-212). Thus, a fabrication process that providesimproved performance and lower power consumption may be used tofabricate dies 202 and 204 (and thus provide faster and lower-powerinterface logic and circuitry for the memory controller 214 and profiler216), whereas a fabrication process that provides improved cell densityand improved leakage control may be used to fabricate the dies 206-212(and thus provide more dense, lower-leakage bitcells for the stackedmemory).

In another approach, the layers 202-212 are fabricated using amonolithic 3D fabrication process whereby a single substrate is used andeach die layer is formed on a preceding die layer using a layer transferprocess, such as an ion-cut process. The stacked memory device also maybe fabricated using a combination of techniques. For example, logiclayers (dies 204 and 206) may be fabricated using a monolithic 3Dtechnique, the memory layers (dies 206-212) may be fabricated using adie-on-die or wafer-on-wafer technique, or vice versa, and the resultinglogic layer stack and memory layer stack then may be bonded together andthen to bonded to the interposer substrate.

Although the processing system 100 and the processing system 200 eachimplement a single instance of memory controller 102, in otherembodiments multiple memory controllers may be implemented in order toincrease system performance, with each memory controller controllingseparate portions of system memory. Each of the multiple memorycontrollers may implement separate instances of the region-based memoryaccess scheduling techniques described herein.

FIG. 3 is a diagram illustrating an exploded perspective view of anexample implementation of the processing system 100 as a stacked-dieprocessing system 300 with multiple memory controllers in accordancewith some embodiments. In the depicted example, the stacked-dieprocessing system 300 is configured in a side-split arrangement thatincludes a die 302 and a vertically-stacked memory subsystem comprisedof dies 304, 306, 308, 310, and 312 (collectively, “dies 304-312”). Thedie 302 implements the components of the processor 101, including one ormore processor cores 301 and 303, a cache 305, and the like. In theillustrated example, dies 306-312 implement the memory cell circuitry330 of the memory array 104 (FIG. 1). A host bus 322 connects the die302 and the die 304. The die 304 implements two memory controllers 314and 316 (embodiments of the memory controller 102, FIG. 1) as well asprofiling logic 318 and profiling logic 320 (embodiments of theprofiling logic 110, FIG. 1). Each of the memory controllers 314 and 316control corresponding independent portions of the attached memory arraysin order increase memory bandwidth and throughput. The profiling logic318 operates to maintain memory timing information for the regions ofthe memory array 104 served by the memory controller 314 and theprofiling logic 320 operates to maintain memory timing information forthe regions of the memory array 104 served by the memory controller 316.

The depicted processing system 300 may be implemented with a varietypackaging techniques. In some embodiments, memory subsystem implementedas dies 304-312 may be implemented as a separate vertical stack,horizontally disposed on the surface of an interposer along with die302. In this embodiment, the host bus 322 can include conductorsimplemented in the metal layers of the interposer. In anotherembodiment, the processor die 302 and the stack of dies 304-312 may becontained in separate integrated circuit (IC) packages. In suchimplementations, the IC packages may be placed on a circuit board orsimilar substrate and the host bus 322 implemented using conductors ofthe circuit board, including pads, pins, traces, vias, solder bumps andthe like.

FIG. 4 is a diagram illustrating a more detailed implementation of thememory controller 102 in accordance with some embodiments. In thedepicted example, the memory controller 102 includes a host interface404, an arbiter 406, profiling logic 410 (corresponding to the profilinglogic 110, FIG. 1), a timing data store 412 (corresponding to the timingdata store 112, FIG. 1), a command (CMD) input queue 413, a write queue414, a read queue 415, a scheduler 420, and a physical interface (PHY)422. The arbiter, queues 413, 414, and 415, and the scheduler 420together form the controller logic 108 (FIG. 1). A host bus 402 couplesthe host interface 404 to the processor 101 (FIG. 1), whereby the hostinterface 404 receives and buffers signaling representative of memoryaccess requests transmitted over the host bus 402 to the memorycontroller 102. The host interface also transmits memory access requestresponses to the processor 101 via the host bus 402. The arbiter 406receives memory access requests from the host interface 404 and from theprofiling logic 410, and resolves conflicts for the command input queue413 and the write queue 414, which buffer multiple commands and memorywrite requests, respectively. These commands and write requests arestored in-order but may be selected and executed on out-of-order toimprove performance. The read queue 415 buffers responses to memory readrequests, prior to being selected by the arbiter 406 for transmission tothe host interface 404.

The scheduler 420 selects memory access requests from queues 413 and 414and translates those requests into DRAM command sequences. In someembodiments, the scheduler 420 implements a set of algorithms andheuristics for reordering queued memory access requests. Thesealgorithms and heuristics may consider DRAM structure such as banks,rows and columns, as well as region-specific memory timing parameters inorder to generate a schedule of access to the memory array 104 (FIG. 1)via the PHY 422. By re-ordering the memory requests, the scheduler 420can reduce latency and improve bandwidth of the memory subsystem.

Each DRAM operation generates different demands and claims on variousDRAM resources such as the internal banks, address lines and data lines.When creating a schedule of memory access requests, the scheduler 420takes into account potential resource conflicts and ensures that therequired resources are available at the appropriate time. In addition toconsidering resource constraints, the scheduler 420 also generates aschedule that conforms to the DRAM timing constraints and timingparameters in order to ensure correct operation. Examples of DRAM timingparameters considered by the scheduler 420 include, but are not limitedto: the row to column command delay (tRCD) parameter, the time betweencolumn command and data out (tCL) parameter, the time between columncommands (tCCD) parameter, the precharge time (tRP) parameter, and theminimum row open time (tRAS) parameter.

In conventional systems, these memory timing parameters are typicallyloaded into the memory controller and applied for scheduling memoryaccess requests without specific regard to the region of memory beingaccessed. However, as described above, the memory controller 102, insome embodiments, considers the memory timing parameters on aregion-by-region basis when scheduling memory access requests. To thisend, the profiling logic 410 implements logic to test and measure one ormore of the memory timing parameters for each region of a set of regionsof the memory array 104. For a DRAM implementation, DRAM timingparameters measured may include, but are not limited to tRCD (row tocolumn command delay), tCL (time between column command and data out),tCCD (time between column commands), tRP (precharge time), tRAS (minimumrow open time), tFAW (multi-bank activation window), tWTR (time betweenread and write), tWR (write recovery time), and the like. For arraysutilizing memory technologies other than DRAM, the timing informationmay include other timing parameters as appropriate for those types ofmemory. The profiling logic 410 may accomplish measurement of timingparameters by generating sequences of memory read and memory writecommands and observing the results while varying the parameter beingtested. In the depicted example, the profiling logic 410 generatessequences of commands and sends them to the arbiter 406 to be eventuallyplaced into the queues 413 and 414. In other embodiments, the memorycontroller 102 may implement separate queues for the testing sequencesgenerated by the profiling logic 410. The regions of the memory array104 may be cell-based, row-based, bank-based, rank-based, chip-based,and the like. Moreover, the profiling logic 410 may maintain differentlevels of region granularity. For example, the profiling logic 410 maydetermine DRAM timing parameters that apply to each row of one bank, butdetermine DRAM timing parameters on a sub-regions of another bank (e.g.,based on a row-by-row basis for the other bank). In such instances, asub-region is a smaller unit of granularity compared to thecorresponding region (e.g., a sub-region of a bank region could be arow, a sub-region of a row region could be a bit cell or a subset of thebit cells of the row, etc.) Moreover, the granularity of the memoryregion may depend on the memory timing parameters. For example, theprofiling logic 410 may track the tRP parameters on a bank-by-bank basiswhile tracking the tRCD parameters on a row-by-row basis.

The region-by-region timing parameter analysis may be triggered by anyof a variety of trigger events. For example, the profiling logic 410 mayinclude one or more static or configurable trigger timers to initiateperiodic reevaluation. As another example, the trigger events caninclude certain system events or signaling directed by other systemcomponents, such as reevaluation signaling initiated by an operatingsystem, hypervisor, basic input output system (BIOS), firmware or acombination thereof. To illustrate, at system boot-up the BIOS maydirect the profiling logic 410 to perform an initial set of measurementsearly in the boot process and then periodically at some pre-determinedinterval. In another example, detection of a correctable oruncorrectable memory error may cause the operating system or hypervisorto trigger the profiling logic 410. As circuit operation often isinfluenced by temperature, the profiling logic 410 may use a temperaturethreshold or a temperature change threshold trigger to initiatereevaluation of the memory timing parameters.

The timing data store 412 receives and stores the DRAM timing parametersgenerated by the profiling logic 410 so as to be subsequently accessibleby scheduler 420. In some embodiments, this timing information may bestored as the actual value of the parameter. For example, the tRPparameter stored for a given row may be the actual number of cyclesrequired to recharge the given row. Alternatively, the timinginformation may be categorized into several levels or bins, and anindication of the bin number is stored. For example the tRP parametercould be classified into 4 levels or bins, such as: <=7 cycles; 8-9cycles; 10-11 cycles; and 12-13 cycles.

The timing data store 412 may be implemented by any number of storagemechanisms, including but not limited to bloom filters, contentaddressable memory (CAMs), caches, registers and the like. The storagemechanism may be chosen based on the size and number of storedparameters in order to reduce storage space. In some embodiments, thetiming data store 412 may implement different types of storage (bloomfilters, CAM arrays and such) for each timing parameter tracked by theprofiling logic 410. To illustrate, a bloom filter may be used to storea binned tRP value as described above. One bloom filter is implementedfor each tRP bin. Bloom filter 1 represents bin 1, bloom filter 2represents bin 2 and so on. In operation, the profiling logic 410determines that the tRP for say, row 5, is classified into the secondbin. The profiling logic 410 sends the row number and bin number to thetiming data store 412. The timing data store 412 enters the index of row5 into the second tRP bloom filter. Later in subsequent operations, thescheduler 420 may consult the timing data store 412 to determine the tRPfor row 5. It sends the row number (5) and an indication of therequested parameter (tRP) to timing data store 412. The timing datastore 412 consults the tRP bloom filters, and returns the tRP bin forrow 5 back to the scheduler 420.

With the region-by-region memory timing parameters identified and storedin the timing data store 412, the scheduler 420 may utilize the storedregion-based memory timing parameters to more optimally schedule memoryaccess requests based on the regions they target. Thus, in response toreceiving a memory access request to a target address, the scheduler 420accesses from the timing data store 412 the DRAM timing parametersassociated with the region containing the target address, and uses theaccessed DRAM timing parameters rather than (or in addition to)memory-wide timing parameters to generate more optimized memory accessschedules. For example, the tRCD (the row to column command delay)timing parameter is a fundamental portion of the overall latency of aDRAM read operation. tRCD represents the minimum delay required betweenan ‘activation row’ DRAM command and the subsequent ‘column read’ DRAMcommand. Suppose for a given bank tRDC is 5 cycles for some rows and is4 cycles for other faster rows. A conventional memory controller woulduse the most conservative timing of tRCD=5 for all rows. In contrast,the present invention having stored data representative of tRCD for eachrow in the timing data store 412, allows the scheduler 420 to utilizetRCD=4 for the faster rows, thus reducing by one cycle the latency of aDRAM read operation to those rows. The profiling logic 410 implementslogic to test and measure one or more of the memory timing parametersfor each region of a set of regions of the memory array 104.

FIG. 5 is a flow diagram illustrating a method 500 of profiling memory aperformance parameter on a region-by-region basis in accordance withsome embodiments. For ease of illustration, the method 500 is describedin the example context of the memory controller block 102 as shown inFIG. 4, and for determining the tCL memory timing parameter as specifiedfor DRAM architectures. A similar process may be implemented to measureother memory timing parameters, such as tRCD, tCCD, tRP, and tRAS, tFAW,tWTR, and tWR parameters.

At block 502, the profiling logic 410 initiates the region-by-regionmemory profiling in response to a trigger event, such as an intervaltimeout or other system event as described earlier. At block 504, theprofiling logic 410 selects a DRAM row buffer of a DRAM array 106(FIG. 1) to be tested and sets the tCL parameter for the selected DRAMrow buffer to the maximum value. At block 506, the profiling logic 410selects a first test data pattern to be used to verify that the rowbuffer functions correctly at a given tCL value. The data test patternsmay include any number of well-known test patterns used to detect memoryerrors such as ‘all ones’, ‘all zeros’, ‘alternating ones and zeros’,‘walking ones’ and the like. The data test patterns may further includepatterns specifically designed for the particular implementation of DRAMarray 106. At block 508, the profiling logic 410 generates a memorywrite request to an address corresponding to the row buffer under test.The write request is transmitted to the scheduler 420 (FIG. 4) via thewrite queue 414 (FIG. 4) and the arbiter 406 (FIG. 4). At block 510, theprofiling logic 410 generates a memory read request to the same addressunder the set tCL value. The read request is similarly transmitted tothe scheduler 420 via the read queue 415 (FIG. 4). At block 512, theprofiling logic 410 compares the data read back from the row buffer tothe test pattern. If the data read back does not match, at block 514 theprofiling logic 410 stores the last known good tCL in the timing datastore 412. As noted above, the timing data store 412 may be implementedas a bloom filter in order to reduce storage requirements. Otherwise, ifthe data read back matches, testing proceeds to block 516, whereupon thetest process loops back and selects the next test pattern. The processof blocks 506-516 repeats with each selected test pattern until the lastpattern is tested (block 518). If all test patterns have passed, atblock 520 the profiling logic 410 selects the next lower value for thetCL parameter and the process of blocks 506-518 is repeated for the thisvalue for the tCL parameter. This process continues until a failingvalue for the tCL parameter has been detected at block 512, at whichpoint the last known good tCL value is stored as the row-specific tCLparameter for the row under test at block 522. Thus, at the completionof the method 500, the timing data store 412 has been updated with thefastest tCL value for the particular row under test. Although the methodmeasures tCL for one row buffer for ease of illustration, in practicethe method 500 would be repeated to measure tCL for any number of memoryregions. Accordingly, the method 500 may be repeated for the next rowunder test, and so forth, until the fastest tCL value has beenidentified and stored for each of the rows of a portion of the entiretyof the memory array 104.

With the memory timing parameters identified for a given region andstored in the timing data store 412, the scheduler 420 may utilize thestored region-based memory timing parameters to more optimally schedulememory access requests. The resulting improvement in memory latency andbandwidth may be better understood by examining a timing diagram for aDRAM ‘read’ operation.

FIG. 6 is a timing diagram illustrating an example use of profiledregion-based memory timing parameters by the memory controller 102 (FIG.4) to schedule memory access requests so as to improve memory latency.For ease of illustration, the scheduling and timing of a DRAM readoperation is shown, using the profiled values tRCD and tCL, however thememory controller may use any one of, or a combination of profiled DRAMtiming parameters to schedule commands accordingly. The overall latencyof a read operation for a previously unopened row is the sum of tRCD andtCL timing parameters. tRCD represents the minimum delay requiredbetween an ‘activate row’ DRAM command and the subsequent ‘column read’DRAM command. tCL represents the minimum delay between the ‘column read’command and the availability of the first word of data being returned.

FIG. 6 compares the timing for two separate DRAM read operations, thefirst read operation is for location ‘X’ and the second is for location‘Y’. The profiling logic 410 (FIG. 4) is assumed to have previouslymeasured and stored the tRCD and tCL timing parameters for variousmemory regions including DRAM rows ‘X’ and ‘Y’ used in the example,whereby for row ‘X’ the tRCD parameter is set to 6 cycles (tRCD_(x)=6)and the tCL parameter is set to 6 cycles (tCL_(x)=6) and for row ‘Y’ thetRCD parameter is set to 4 cycles (tRCD_(Y)=4) and the tCL parameter isset to 5 cycles (tCL_(Y)=5).

At clock cycle T₀, the memory controller 102 issues an ‘activate row’command 600 to row X of the memory array 104 (FIG. 1). Six cycles laterat clock cycle T₆, the memory controller 102 issues a ‘column read’command 602 using the profiled value of tRCD_(x)=6 for row ‘X’. Next,using the profiled value of tCL_(x)=6 for row ‘X’ the memory controllerwaits six cycles for the memory data 604 to be returned at clock cycleT₁₂. Thus the total read latency for location X is 12 cycles.Contemporaneously at clock cycle T₀, the memory controller 102 issues an‘activate row’ command 606 for row Y. Using the profiled valuetRCD_(y)=4 for row ‘Y’, the memory controller 102 issues a ‘column read’command 608 four cycles later at clock cycle T₄. Next, using theprofiled value of tCL_(y)=5 for row ‘Y’, the memory controller 102 waitsfive cycles for the memory data 610 to be returned at clock cycle T₉,for a total of 9 cycles of latency.

If the memory controller 102 were to use the conventional worst-casetiming parameters across the entire memory, the tRCD parameter for theentire memory would have been set to 6 and thus the ‘Read’ command 608would not have been issued until 6 cycles after the ‘activate row’command 606. Likewise, the memory controller 102 would have also assumedthat the memory data 610 would not be returned until 6 cycles after the‘column read’ command 608 was issued. However, by using profiling logic410 to exploit the variations in the tRCD and tCL timing parameters on arow-by-row basis, the memory controller 102 is able to issue the ‘columnread’ command 608 to row Y two cycles earlier and access the resultingmemory data 610 one cycle earlier, thereby reducing the latency of thememory data 610 returned for the ‘read’ command 608 to row Y by threecycles, and thereby increasing the overall processing performance of thesystem.

In some embodiments, the apparatus and techniques described above areimplemented in a system comprising one or more integrated circuit (IC)devices (also referred to as integrated circuit packages or microchips),such as the processing systems of FIGS. 1-3. Electronic designautomation (EDA) and computer aided design (CAD) software tools may beused in the design and fabrication of these IC devices. These designtools typically are represented as one or more software programs. Theone or more software programs comprise code executable by a computersystem to manipulate the computer system to operate on coderepresentative of circuitry of one or more IC devices so as to performat least a portion of a process to design or adapt a manufacturingsystem to fabricate the circuitry. This code can include instructions,data, or a combination of instructions and data. The softwareinstructions representing a design tool or fabrication tool typicallyare stored in a computer readable storage medium accessible to thecomputing system. Likewise, the code representative of one or morephases of the design or fabrication of an IC device may be stored in andaccessed from the same computer readable storage medium or a differentcomputer readable storage medium.

A computer readable storage medium may include any storage medium, orcombination of storage media, accessible by a computer system during useto provide instructions and/or data to the computer system. Such storagemedia can include, but is not limited to, optical media (e.g., compactdisc (CD), digital versatile disc (DVD), Blu-Ray disc), magnetic media(e.g., floppy disc , magnetic tape, or magnetic hard drive), volatilememory (e.g., random access memory (RAM) or cache), non-volatile memory(e.g., read-only memory (ROM) or Flash memory), ormicroelectromechanical systems (MEMS)-based storage media. The computerreadable storage medium may be embedded in the computing system (e.g.,system RAM or ROM), fixedly attached to the computing system (e.g., amagnetic hard drive), removably attached to the computing system (e.g.,an optical disc or Universal Serial Bus (USB)-based Flash memory), orcoupled to the computer system via a wired or wireless network (e.g.,network accessible storage (NAS)).

FIG. 7 is a flow diagram illustrating an example method 700 for thedesign and fabrication of an IC device implementing one or more aspectsdescribed above. As noted above, the code generated for each of thefollowing processes is stored or otherwise embodied in computer readablestorage media for access and use by the corresponding design tool orfabrication tool.

At block 702 a functional specification for the IC device is generated.The functional specification (often referred to as a micro architecturespecification (MAS)) may be represented by any of a variety ofprogramming languages or modeling languages, including C, C++, SystemC,Simulink, or MATLAB.

At block 704, the functional specification is used to generate hardwaredescription code representative of the hardware of the IC device. Insome embodiments, the hardware description code is represented using atleast one Hardware Description Language (HDL), which comprises any of avariety of computer languages, specification languages, or modelinglanguages for the formal description and design of the circuits of theIC device. The generated HDL code typically represents the operation ofthe circuits of the IC device, the design and organization of thecircuits, and tests to verify correct operation of the IC device throughsimulation. Examples of HDL include Analog HDL (AHDL), Verilog HDL,SystemVerilog HDL, and VHDL. For IC devices implementing synchronizeddigital circuits, the hardware descriptor code may include registertransfer level (RTL) code to provide an abstract representation of theoperations of the synchronous digital circuits. For other types ofcircuitry, the hardware descriptor code may include behavior-level codeto provide an abstract representation of the circuitry's operation. TheHDL model represented by the hardware description code typically issubjected to one or more rounds of simulation and debugging to passdesign verification.

After verifying the design represented by the hardware description code,at block 706 a synthesis tool is used to synthesize the hardwaredescription code to generate code representing or defining an initialphysical implementation of the circuitry of the IC device. In someembodiments, the synthesis tool generates one or more netlistscomprising circuit device instances (e.g., gates, transistors,resistors, capacitors, inductors, diodes, etc.) and the nets, orconnections, between the circuit device instances. Alternatively, all ora portion of a netlist can be generated manually without the use of asynthesis tool. As with the hardware description code, the netlists maybe subjected to one or more test and verification processes before afinal set of one or more netlists is generated.

Alternatively, a schematic editor tool can be used to draft a schematicof circuitry of the IC device and a schematic capture tool then may beused to capture the resulting circuit diagram and to generate one ormore netlists (stored on a computer readable media) representing thecomponents and connectivity of the circuit diagram. The captured circuitdiagram may then be subjected to one or more rounds of simulation fortesting and verification.

At block 708, one or more EDA tools use the netlists produced at block706 to generate code representing the physical layout of the circuitryof the IC device. This process can include, for example, a placementtool using the netlists to determine or fix the location of each elementof the circuitry of the IC device. Further, a routing tool builds on theplacement process to add and route the wires needed to connect thecircuit elements in accordance with the netlist(s). The resulting coderepresents a three-dimensional model of the IC device. The code may berepresented in a database file format, such as, for example, the GraphicDatabase System II (GDSII) format. Data in this format typicallyrepresents geometric shapes, text labels, and other information aboutthe circuit layout in hierarchical form.

At block 710, the physical layout code (e.g., GDSII code) is provided toa manufacturing facility, which uses the physical layout code toconfigure or otherwise adapt fabrication tools of the manufacturingfacility (e.g., through mask works) to fabricate the IC device. That is,the physical layout code may be programmed into one or more computersystems, which may then control, in whole or part, the operation of thetools of the manufacturing facility or the manufacturing operationsperformed therein.

Note that not all of the activities or elements described above in thegeneral description are required, that a portion of a specific activityor device may not be required, and that one or more further activitiesmay be performed, or elements included, in addition to those described.Still further, the order in which activities are listed are notnecessarily the order in which they are performed.

Also, the concepts have been described with reference to specificembodiments. However, one of ordinary skill in the art appreciates thatvarious modifications and changes can be made without departing from thescope of the present disclosure as set forth in the claims below.Accordingly, the specification and figures are to be regarded in anillustrative rather than a restrictive sense, and all such modificationsare intended to be included within the scope of the present disclosure.

Benefits, other advantages, and solutions to problems have beendescribed above with regard to specific embodiments. However, thebenefits, advantages, solutions to problems, and any feature(s) that maycause any benefit, advantage, or solution to occur or become morepronounced are not to be construed as a critical, required, or essentialfeature of any or all the claims.

1-20. (canceled)
 21. A method comprising: identifying region-specificmemory timing parameters for a first and second region of a memory,wherein the region-specific memory timing parameters for the firstregion is independent of the region-specific memory timing parametersfor the second region of the memory; and in response to receiving memoryaccess requests to target addresses within the first and second regions,reordering the memory access requests based on the region-specificmemory timing parameters such that the sequencing of the memory accessesreflects the region-specific memory timing parameters.
 22. The method ofclaim 21, further comprising: storing data representative of theidentified region-specific memory timing parameters to a timing datastore.
 23. The method of claim 22, wherein the timing data storecomprises a first cache on a logic die and a second cache on a memorydie.
 24. The method of claim 21, further comprising: re-identifying theregion-specific memory timing parameters in response to at least one of:a power-on reset; a lapse of a timer; a sensed change in temperature;and a trigger from a controller.
 25. The method of claim 21, whereinidentifying the region-specific memory timing parameters for each of thefirst and second regions comprises: measuring a type of memory timingparameter of a plurality of types of memory timing parameters for eachsub-region of a plurality of sub-regions of each region; and identifyingthe slowest measured memory timing parameter corresponding to eachmeasured type of memory timing parameter of the sub-regions as thememory timing parameter for each corresponding measured type of memorytiming parameter for each region.
 26. The method of claim 21, wherein:the memory comprises a dynamic random access memory (DRAM); and theregion-specific memory timing parameters comprise at least one of: a rowto column command delay (tRCD); a time between column command and dataout (tCL); a time between column commands (tCCD); a precharge time(tRP); a row open time (tRAS); a multi-bank activation window (tFAW); atime between read and write (tWTR); and a write recovery time (tWR). 27.An integrated circuit (IC) device comprising: a memory controllercoupled to a memory, the memory controller configured to: reorder memoryaccesses to a first region and a second region of a plurality of regionsof the memory based on a first set of one or more memory timingparameters specific to the first region and a second set of one or morememory timing parameters specific to the second region in response toreceiving memory access requests targeting addresses associated with thefirst region and the second region, wherein the memory controller is toreorder memory accesses such that the sequencing of the memory accessesreflects the memory timing parameters specific to the first region andthe second region; and access data at the first region and the secondregion in response to the reordered memory accesses.
 28. The IC deviceof claim 27, further comprising: a timing data store that maintains datarepresentative of the first set of one or more memory timing parameters;and wherein, in response to a memory access to be performed with respectto the first region, the memory controller is to access the datarepresentative of the first set of one or more memory timing parametersspecific to the first region and to schedule the memory access based onthe accessed data.
 29. The IC device of claim 28, wherein the timingdata store comprises a first cache on a logic die and a second cache ona memory die.
 30. The IC device of claim 28, further comprising:profiling logic to: identify the first set of one or more memory timingparameters specific to the first region; and identify the second set ofone or more memory timing parameters specific to the second region. 31.The IC device of claim 30, further comprising: a stacked-die memorydevice comprising: a set of one or more stacked memory dies comprisingthe memory; and a set of one or more logic dies electrically coupled tothe stacked memory dies, the set of one or more logic dies comprisingthe memory controller and the profiling logic.
 32. The IC device ofclaim 31, wherein the set of one or more stacked memory dies and the setof one or more logic dies are disposed in a stacked configurationwhereby the set of one or more logic dies is connected to the set of oneor more stacked memory dies via a set of through silicon vias.
 33. TheIC device of claim 31 wherein the set of one or more logic diesimplements the timing data store as a bloom filter.
 34. The IC device ofclaim 31, wherein the set of one or more stacked memory dies implementsthe timing data store.
 35. A non-transitory computer readable mediumembodying a set of executable instructions, the set of executableinstructions to manipulate one or more processors to: identify a firstset of one or more memory timing parameters specific to a first regionof a plurality of regions of a stacked-die memory device; and identify asecond set of one or more memory timing parameters specific to a secondregion of the plurality of regions of the stacked-die memory device,wherein the second set of one or more memory timing parameters isindependent of the first set of one or more memory timing parameters;and in response to receiving a first memory access request to access thefirst region and a second memory access request to access the secondregion, reorder the first memory access request and the second memoryaccess request such that the sequencing of memory accesses reflects thememory timing parameters specific to the first region and the secondregion.
 36. The non-transitory computer readable medium of claim 35,wherein the set of executable instructions is to manipulate the one ormore processors to: store data representative of the memory timingparameters for each of the first region and the second region at atiming data store; and in response to receiving the first memory accessrequest to access the first region and the second memory access requestto access the second region, access, from the timing data store, firstdata representing the first set of one or more memory timing parametersand second data representing the second set of one or more memory timingparameters.
 37. The non-transitory computer readable medium of claim 36,wherein the timing data store comprises a first cache on a logic die anda second cache on a memory die.
 38. The non-transitory computer readablemedium of claim 36, wherein the timing data store is implemented as abloom filter.
 39. The non-transitory computer readable medium of claim35, wherein the set of executable instructions is to manipulate the oneor more processors to: identify at least one of the first set of one ormore memory timing parameters for the first region and the second set ofone or more memory timing parameters for the second region is inresponse to at least one of: a power-on reset; a lapse of a timer; asensed change in temperature; and a trigger from a controller.
 40. Thenon-transitory computer readable medium of claim 35, wherein the set ofexecutable instructions is to manipulate the one or more processors to:measure a type of memory timing parameter of a plurality of types ofmemory timing parameters for each sub-region of a plurality ofsub-regions of the region; and identify the slowest measured memorytiming parameter corresponding to each measured type of memory timingparameter of the sub-regions as the memory timing parameter for eachcorresponding measured type of memory timing parameter for the region.